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Involvement of digital information in almost of enterprise sectors makes 
information having value that must be protected from information leakage. In 
order to obtain proper method for protecting sensitive information, enterprise 
must perform risk analysis of threat. However, enterprises often get 
limitation in measuring risk related information security threat. Therefore, 
this paper has goal to give approach for estimating risk by using information 
value. Techniques for measuring information value in this paper are text 
mining and Jaccard method. Text mining is used to recognize information 
pattern based on three classes namely high business impact, medium business 
impact and low business impact. Furthermore, information is given weight 
by Jaccard method. The weight represents risk levelof information leakage in 
enterprise quantitatively. Result of comparative analysis with existing 
method show that proposed method results more detailed output in 
estimating risk of information security threat. 
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1. INTRODUCTION 

The digital information supports business of enterprises by giving knowledge to users such as staffs, 
investors, customers and business management. Most of digital information consist of high sensitivity value, 
so protection method is needed to be applied in system of information technology [1], However, data leakage 
due to flaws of IT system isstill occurred and it causes serious impact to enterprises. Average cost from 
incident of data leakage is about US$3.8 million [2], In order to reduce impact of incident, enterprise must 
identify threat and perform mitigation. Appropriate mitigation procedure can be formulated after enterprises 
know about level of risk in threat. However, estimation of risk level is not simple thing. Enterprises must use 
risk model as reference to calculate risk level. 

In data leakage case, risk level can be estimated by qualitative or quantitative method. Importance 
level of information is used commonly in qualitative method and financial value is used in quantitative 
method [3], The problem of financial approach in quantitative method is difficult to be implemented because 
user must know representation of information in financial metrics Users are also required to have direct 
access to financial report. It becomes new challenge to identify new approach to estimate risk of information 
value in quantitative method. 

An approach of information value estimation is performed by giving weighting for information 
term. Pribadi et al. gave weight for information value in automated short answer scoring case [4], Five 
classes were used to represent information value, i.e. highly important term, very important term, important 
term, fairly important term and not important term. Weighting of information term from Pribadi et al. study 
can be adopted to estimate risk of information security. Meanwhile, high business impact (HBI) term. 
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medium business impact (MBI) term and low business impact (LBI) term can be used as classes in forming 
of risk level [5], Based on previous study related weighting and classification of information term above, this 
paper has objective to develop new approach for risk estimation in information security area. 

In order to reach objective, we divide this paper to several sections. Section 2 reveals previous study 
related risk estimation and information measurement. Differences between previous study and proposed 
method are explained in this section. Section 3 is research method where it contains methodology to achieve 
goal. Experimental details are described in section 4. In section 4, we explain process to form risk level of 
information security threat from data source. Result from experimental details is revealed in section 5. 
Comparation between proposed method and previous method is also explained in section 5. 


2. RELATED WORK 

In previous study, information value in enterprises is related with meaning of information to 
business [3], In risk analysis, information value has three categories namely High Business Impact (HBI), 
Medium Business Impact (MBI) andLow Business Impact (LBI) [5]. High Business Impact (HBI) is data that 
has a severe impact for information owner and organization in case of data leakage. Information that has 
impact in reputation damage, is included Medium Business Impact (MBI) category, whereas Low Business 
Impact (LBI) is information that has limited impact to owner of information or organization. 

Some methods are also used to estimate information value in previous study. Sajko et al. developed 
method to measure information value by calculating volume of information. Volume is calculated by three 
variables like meaning information in business, time and cost for producing information [3]. Dimension from 
volume of information can be presented in Figure 1. 


Time (t) 



Cost for producing information (c) 

Figure 1. Dimension of information volume 


Information value was represented by volume of information (V in f). Meanwhile, volume of 
information was measured by involving three variables namely meaning information in business ( m ), time (t) 
and cost for producing information (c). Relation between volume of information and its variables can be 
descrbed in Equation 1. 

Information Value = V in f{m,t,c} (1) 

Weight for each variable in Formula 1 was obtained from survey method. Assessment tool was built in 
questionnaire. Experts as respondents of assessment chose ordinal value or interval value as option to 
represent weight of variables. 

However, use of expert opinion for filling weight of variables gives subjective grade in information 
value estimation. Therefore, Gao et al. developed new approach to estimate information value because use of 
expert opinion in assessment was old ways that increased complexity in operation [6]. Clustering method and 
Fuzzy algorithm were used by Gao et al. to estimate information value. Fuzzy algorithm was used to quantify 
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information risk factors. Furthermore, results of Fuzzy processing were divided in four clusters namely LI, 
L2, L3 and L4. LI represents the minimum value and L4 represents the highest value. Clustering method 
used by Gao et al was K-Means. In comparison, method from Gao et al. is more objective than method from 
Sajko et al. However, method from Gao et al needs minimum number of data as data training in early 
process. 

This paper uses different approach to view business risk of enterprises. Number of data leakage 
becomes point for estimating risk. Enterprise has high risk if it has big number of data leakage that involves 
sensitive information. Therefore, involvement of text mining and laccard method becomes important thing to 
develop new approach of risk analysis in this paper. Text mining is used to classify sensitivity of information 
and Jaccard method is used to calculate weight of information. Use of text mining to classify information was 
ever used by Data Leakage/Loss Prevention (DLP) [7], [8] whereas Jaccard method was used for weighting 
in similarity function of information retrieval system [9], [10] and plagiarism detection [11], Jaccard method 
is possible to be implemented for estimating value of information in document by calculating number of 
specific term that represents a category in this paper [4], 


3. RESEARCH METHOD 

Proper mechanism for processing unstructured data is text mining [12]. It can be used in 
classification function that categorizes sets of string and inputs appropriate word into a category [13]. 
Regular expression is technique that can be used to recognize word for a category by pattern matching or 
keyword matching. Steps of research in this paper are shown in Figure 2: 



Figure 2. Steps of research method 


In pre-processing step, sets of string in document are processed through lowercase conversion, 
punctuation removing, stemming, tokenization and stopwords removing. Pre-processing step prepares 
document of data sourceso it can be processed in filtering step. 

Filtering step is to define categories and criteria. This paper refers to Ruivo et al. categories and 
criteria [5]: 

• High Business Impact 

It consists of words: passwords, bank account, credit card number 

• Medium Business Impact 

It consists of words: information of customer specification 

• Low Business Impact 

It consists of words: gender, address 

Categories and criteria are implemented in wordlist. Regular expression technique refers to that wordlist for 
recognizing term in document ofdata source [14]. 

Categorized words are processed in analysis step by estimating weight for every word in a category. 
Weight total of each category is sum of weight of words in that category. Jaccard method is used to estimate 
weight for each word in category. It calculates ratio between number of occurrences word that defined in 
wordlist (JVj and total number of unique words in document (D) [15], Result of Jaccard method is 
coefficient. Jaccard coefficient (Jaccard (W, D)) is determined from division operation of intersection size 
(WClD) and union size (VV'U/J). Calculation of Jaccard coefficient uses Equation 2. 

Jaccard(W,D) — (2) 
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4. EXPERIMENTAL DETAILS 

This paper defines disclosed information from information security assessment as data source. In 
order to obtain disclosed information, SQL injection attack is used as assessment method of information 
security. Table 1 is result of information security assessment. 


username 
Userl 
User2 
User3 
User4 
User5 
User6 
User7 
User8 
User9 
UserlO 
Userl 1 
Userl2 
Userl3 
Userl4 
Userl5 
Userl6 
Userl7 
Userl 8 
Userl9 
User20 
User21 
User22 
User23 
User24 
User25 


Table 1. Experimental Data Source 

password_lastlogin 


15 th December 2015 03:17PM 
2nd November 2015 03:34AM 


a f59b75d998d4e6869caea0b22bc8f5c 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

8e721dlc51f5109c989c77d9275fcf61 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 

b2805c093f83761e5aba2al45067ddc7 


5th November 2015 01:07AM 
1st November 2015 12:24PM 
6th November 2015 09:23AM 
8th September 2015 09:40AM 
16th October 2015 09:42PM 
28th August 2015 11:15PM 
3rd October 2015 06:22AM 
27th December 2014 03:42AM 
29th December 2014 07:00AM 
19th April 2005 04:15PM 
19th April 2005 06:26PM 
13th October 2015 03:56AM 
20th April 2005 11:11AM 
24th November 2014 03:32AM 
22nd April 2005 11:14AM 
27th December 2014 03:44AM 
26th April 2005 02:24PM 
22nd April 2005 01:46PM 
24th April 2005 12:50PM 
25th April 2005 10:15AM 
05th February 2006 02:04PM 
12th November 2014 03:49AM 
6th November 2014 05:47AM 


*) users and emails are censored for security reason 


status 

email 

active 

email 1 

active 

email2 

active 

email3 

active 

email4 

active 

email5 

active 

email6 

active 

email7 

active 

email8 

active 

email9 

active 

email 10 

active 

email 11 

active 

email 12 

active 

email 13 

active 

email 14 

active 

email 15 

active 

email 16 

active 

email 17 

active 

email 18 

active 

email 19 

active 

email20 

active 

email21 

active 

email22 

active 

email23 

active 

email24 

active 

email25 


Based on data format of Table 1, high business impact (HBI) category is represented in column 
name “username” and “password”, whereas low business impact (LBI) category is represented in column 
name “email”. Other content is defined as public information where it does not have impact to company 
business. It is caused public information having goal for public reader. Algorithm for classification can be 
described in Algorithm 1. 


Algorithm 1: To identify unique word for as part of categories 


Notation: 

document 

wordlist 

token 

tokens 

num_tokens 


List of data 

list of string for defining token categories 
word from output of text mining 
list of tokens 
number of tokens 


Input: file 
Var: 

document, wordlist, token, tokens:string 
num_tokens integer 

Begin 

If(token G wordlist)==truethen 

fori=l to N then 

tokens <-read(token G document) 

end for 

num_tokens G-count(tokens) 

End 

Output: _ 
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tokens — ^ token(i ) 

i=1 

num_tokens = ^ tokens 


5. RESULTS AND ANALYSIS 

Data source in Table 1 has twocategories namelyhigh business impact and low business impact. 
High business impact (HBI) consists of usernameinformation and password information. Meanwhile, email is 
categorized low business impact (LBI) information. Figure 3 represents data distribution from categories of 
result. 



Figure 3. Data distribution from classification 


Calculation of intersection and union size is conducted by involving data classification from 
previous process. It results three intersection data and one union data. Table 2 describes result of intersection 
and union calculation. 


Table 2. Jaccard Variables from Data Source 

Variable Value 

Intersection size between usernameand document from data source 25 

Intersection size between password and document from data source 3 

Intersection size between email and document from data source 25 

Union size between (usemame,password,email) anddocument from data source 79 


Jaccard coefficient is obtained by calculating intersection from each category over union of 
document. In high business impact (HBI) category, Jaccard coefficient is resulted from combination 
intersection of username and password over union of document. Meanwhile, Jaccard coefficient from low 
business impact (LBI) category is obtained from intersection of email over union of document. Result of 
Jaccard coefficient calculation is shown in Table 3. 


Table 3. Jaccard Coefficient 


Jaccard Variables 

Jaccard Coefficient 

Category 

Jaccard((usemame,password), doc) 

0,354 

HBI 

J accard(email,doc) 

0,316 

LBI 
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Jaccard coefficient represents weight of risk for every category. In caseabove, disclosed data from 
enterprise contains 67% of sensitive information where it consists of 35,4% in high business impact and 
31,6% in low business impact. Measurement will result different output when it is implemented in different 
data from different enterprises. It is caused by differences ofvalue and characteristic from information in each 
enterprise. 

In order to obtain description related result of proposed method, this paper makes comparative 
analysis with existing method for estimating risk. In general method, risk of information security is measured 
by two variables namely probability and impact [16], Relation between risk, probability and impact can be 
described in Equation 3. 

Risk = Probability x Impact (3) 

Open Web Application Security Project (OWASP) risk rating is one of methods where it 
implements relation between risk, probability and impact to measure risk [17]. In OWASP risk rating, 
probability is represented by likelihood variable and impact consists of technical impact and business impact. 
OWASP risk rating also considers business perspective to estimate risk so it has similar approach with 
proposed method. Therefore, OWASP risk rating can be chosen as comparison method in process of 
comparative analysis.Based data in Table 1, measurement of OWASP risk rating results medium risk level 
for technical impact and business impact. Description of result from OWASP risk rating is shown in Table 4. 


Table 4. Measurement Result of OWASP Risk Rating 


Aspects 

Likelihood 

Impact 

Risk 

Technical 

Medium 

Medium 

Medium 

Business 

Medium 

Medium 

Medium 


In order to show result of comparative analysis, this paper uses three categories namely method, 
aspect and experimental result. Comparison result between proposed method and OWASP risk rating can be 
shown in Table 5. 


Table 5. Comparative RESULT 



Proposed Method 

OWASP 

Method 

Text Mining + Jaccard Method 

Probability x Impact 

Aspect 

Business 

Business 

Technical 

Experimental Result 

High Business Impact 

Low Business Impact 

Medium 


In comparative result, proposed method and OWASP risk rating have different approach to estimate 
risk from threat of information security. Both methods also result different risk level in business aspect. 
However, proposed method has advantages in resulting more detailed risk because it examines each 
information from disclosed data. It is different with OWASP approach where OWASP view threat of 
information security generally with subjective measurement. In previous research, Jaccard method was faster 
than Cosine Distance algorithm in filtering data [18] so it becomes another advantage from proposed method. 


6. CONCLUSION 

This paper proposes different perspective to measure risk of information disclosure. It uses 
information value to determine risk of information disclosure. Information value in this paper is estimated by 
classification and weighting process. Sensitive information from data leakage is classified by High Business 
Impact (HBI), Medium Business Impact (MBI) and Low Business Impact (LBI). Text mining based on 
keyword and pattern matching is used as method to classify sensitive information from data leakage. Weight 
is given to classes by Jaccard method. Weight is used to give description related risk level quantitatively. 
Calculation of weight involves intersection and union size. In experimental details, data source from an 
organization results two categories of impact i.e. high business impact and low business impact. High 
business impact has weight about 0,354 and low business impact has weight about 0,316. The experimental 
result states that leakage data has 35,4% high sensitive information and 31,6% low sensitive information. In 
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order to obtain advantages of proposed method, comparative analysis is performed by comparing proposed 
method with OWASP risk rating. Comparison of both methods results conclusion that proposed method has 
more detailed output. 
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